Red Wine Quality Analysis

by Roger Duong

V1

May 2018

========================================================

This report explores a dataset containing quality ratings and 11 chemical properties for approximately 1,600 wines. Below is a summary of the dataset. We have transformed the label ‘quality’ into an ordered factor, to facilitate later plotting.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
summary(wine)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

The following section displays 10 univariate plots, which will help to understand the structure of the individual variables in the wine dataset.

We observe that the majority of wines are rated 5 and 6 with a maximum rating of 8. The purpose of this analysis is to identify which chemical properties (features) influence the quality of wines (label).

The distribution of residual sugar shows multiple outliers, with values above 8. The mode appears to be around the region of 2.

I have limited the observations of residual sugar to a shorter interval to better see what was happening around the peak of count observed around 2.0.

Likewise for chorides, we can observe a long-tailed distribution, with a mode at 0.08.

We can observe that the total sulfur dioxide distribution is positively skewed

We can observe that the mode is in the region of 3.3, which is an acid solution.

We can observe that the sulphate distribution is positively skewed, with a mode around 0.6.

Univariate Analysis

What is the structure of your dataset?

There are 1,600 wines in the dataset, with 11 features: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. The variable X is just an index number and will not be used for this analysis.

All variables are numerical.

Some observations: - Most wines are rated 5 or 6 - Wines are acid, with pH mean at 3.3 - Alcohol contents of wines are on average at 10 %

What is/are the main feature(s) of interest in your dataset?

The main features are the acidic properties of the wines (pH, fixed acidity, volatile acidity, citric acid) and alcohol properties of the wines. We will examine how those features can predict the label of wine quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Other features like sugar, sulfur will help examine the relationship between chemical properties and wine quality.

Did you create any new variables from existing variables in the dataset?

I did create the new variable ‘other.sulfur.dioxide’ being the difference between the total sulfur dioxide and the free sulfur dioxide.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I converted the label ‘quality’ from a numerical type to an ordered factor.

Bivariate Plots Section

This section describes the relationships between 2 variables. It starts by plotting a pair plot, which will help to visualize correlations between variables.

Using the visualization from the pair plot, I propose to investigate further the correlation between: - quality and volatile acidity - quality and citric acid - quality and sulphates - quality and alcohol

Selecting the largest correlation coefficient from the correlation matrix, I propose to study in more details the following correlations between: - fixed acidity and citric acid - volatile acidity and citric acid - fixed acidity and pH - citric acid and pH - free sulfur dioxide and total sulfur dioxide - fixed acidity and density - density and alcohol

We can observe that the interquartile ranges vary from between wines of different quality. This should be explored further with an ordinal logistic regression later.

We can also observe that the citric acid level vary from wines of poor quality to wines of high quality. We will investigate further later.

We can observe that high quality wines (7 and 8) have higher alcohol content generally above 11%.

We can observe a good correlation between fixed acidity and citric acid.

We can observe a good correlation between volatile acidity and citric acid.

We can observe a very good correlation between fixed acidity and pH. This is expected as pH is a measure of the total acidity in a solution.

We can observe a very good correlation between citric acid and pH. This is expected as pH is a measure of the total acidity in a solution.

We can observe a very good correlation between total sulfur dioxide and free sulfur dioxide. This is expected as the total.

We can observe good correlation between fixed acidity and density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Using the visualization from the pair plot, we can observe good correlation between: - quality and volatile acidity - quality and citric acid - quality and sulphates - quality and alcohol

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The scatter plots allwed us to observe good correlation between the following features: - fixed acidity and citric acid - volatile acidity and citric acid - fixed acidity and pH - citric acid and pH - free sulfur dioxide and total sulfur dioxide - fixed acidity and density - density and alcohol

Some correlations are expected as some of the chemical properties are chemically dependent between each other: - fixed acidity and pH - citric acid and pH - free sulfur dioxide and total sulfur dioxide

As such, if we were to use them in a prediction model, we should be cautious about mistakenly showing spurious correlations.

What was the strongest relationship you found?

The strongest relationship for the quality of wine aws found with the volatile acidity.

Multivariate Plots Section

In this section, we choose to plot scatterpolts between features that have shown strong relationships previously and add a third variable being the main label of interest (quality).

We can observe on the previous lots that wines of good quality (7 and 8, in green) are generally clustered separately from wines of low quality (3 and 4, in red and orange).

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The relationship between the label quality and volatile acidity, alcohol were strengthened.

Were there any interesting or surprising interactions between features?

The relationship between density on fixed acidity showed interaction.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

We created an ordinal logistic regression model using the MASS library. The purpose of the model is to predict the wine quality using the features available in the dataset.

The relationship studied was: quality ~ volatile.acidity + alcohol + fixed.acidity + density + citric.acid + sulphates

We splitted the dataset into test and training sets. We then created several models by adding features one by one, in order to study the influence of those features in our ability to predict the wine quality. The metric used to evaluate the performance of the model was the accuracy.

We included a model that uses all features as variables to our model.

#Splitting test and train sets
set.seed(1000)
wine_train <- sample_frac(wine, 0.75)
wine_test <- sample_frac(wine, 0.25)

# Models adding one variable at the time

m1 <- polr(quality ~ volatile.acidity, data = wine_train)
m2 <- update(m1, ~ . + alcohol)
m3 <- update(m2, ~ . + density)
m4 <- update(m3, ~ . + fixed.acidity)
m5 <- update(m4, ~ . + sulphates)
m6 <- update(m5, ~ . + citric.acid)

# Model with all variables
m7 <- polr(quality ~ volatile.acidity + alcohol + density + fixed.acidity + sulphates + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + other.sulfur.dioxide + pH, data = wine_train)
summary(m7)
## 
## Re-fitting to get Hessian
## Call:
## polr(formula = quality ~ volatile.acidity + alcohol + density + 
##     fixed.acidity + sulphates + citric.acid + residual.sugar + 
##     chlorides + free.sulfur.dioxide + other.sulfur.dioxide + 
##     pH, data = wine_train)
## 
## Coefficients:
##                          Value Std. Error t value
## volatile.acidity     -3.405290   0.468934 -7.2618
## alcohol               0.911891   0.069009 13.2142
## density               2.526928   1.139739  2.2171
## fixed.acidity         0.035362   0.059677  0.5926
## sulphates             2.739864   0.400287  6.8447
## citric.acid          -0.817503   0.542019 -1.5083
## residual.sugar        0.059375   0.044569  1.3322
## chlorides            -4.668701   1.483980 -3.1461
## free.sulfur.dioxide  -0.001046   0.006263 -0.1670
## other.sulfur.dioxide -0.011570   0.002696 -4.2917
## pH                   -1.842257   0.583175 -3.1590
## 
## Intercepts:
##     Value   Std. Error t value
## 3|4 -0.6275  1.1673    -0.5375
## 4|5  1.2790  1.1653     1.0976
## 5|6  5.0865  1.1662     4.3618
## 6|7  7.8978  1.1773     6.7082
## 7|8 10.8068  1.2058     8.9621
## 
## Residual Deviance: 2302.12 
## AIC: 2334.12
results <- data.frame(wine_test$quality)
colnames(results)[1] <- 'actual'

results$predicted.1 <- ordered(predict(m1, newdata = wine_test), levels = c(3, 4, 5, 6, 7, 8))
results$predicted.2 <- ordered(predict(m2, newdata = wine_test), levels = c(3, 4, 5, 6, 7, 8))
results$predicted.3 <- ordered(predict(m3, newdata = wine_test), levels = c(3, 4, 5, 6, 7, 8))
results$predicted.4 <- ordered(predict(m4, newdata = wine_test), levels = c(3, 4, 5, 6, 7, 8))
results$predicted.5 <- ordered(predict(m5, newdata = wine_test), levels = c(3, 4, 5, 6, 7, 8))
results$predicted.6 <- ordered(predict(m6, newdata = wine_test), levels = c(3, 4, 5, 6, 7, 8))
results$predicted.7 <- ordered(predict(m7, newdata = wine_test), levels = c(3, 4, 5, 6, 7, 8))

The following shows the actual quality and the quality rating predicted by the 6 models we created.

head(results)
##   actual predicted.1 predicted.2 predicted.3 predicted.4 predicted.5
## 1      6           5           5           5           5           5
## 2      4           6           5           5           5           5
## 3      6           6           6           6           6           6
## 4      5           5           5           6           5           5
## 5      6           5           5           5           5           5
## 6      6           5           6           6           6           6
##   predicted.6 predicted.7
## 1           6           6
## 2           5           5
## 3           6           6
## 4           5           5
## 5           5           5
## 6           6           6

The following shows the accuracy score for each of the 6 models we created.

misClasificError <- c(mean(results$predicted.1 != results$actual),
                      mean(results$predicted.2 != results$actual),
                      mean(results$predicted.3 != results$actual),
                      mean(results$predicted.4 != results$actual),
                      mean(results$predicted.5 != results$actual),
                      mean(results$predicted.6 != results$actual),
                      mean(results$predicted.7 != results$actual))
print(1-misClasificError)
## [1] 0.4775 0.5725 0.5650 0.5450 0.5700 0.5725 0.5800

We can observe that the performance of the model is pretty weak, with accuracy scores barely above 0.5, which means that we are able to predict accurately the quality above 50% of the time. This is still higher than random chance at 16.6%, as there are 6 possible quality ratings to predict per wine.

We can observe that the highest gain in accuracy is obtained by adding the variables alcohol (model m2) and sulphates (model m5) to the model. The highest accuracy is achieved by including all variables in the model (model m7) with a score of 0.58.

This score can be almost achieved with only 2 variables of the model m2 with volatile acidity and alcohol.

The limitations of this exercise are: - the choice of the model: choosing multinomial logistics regression may not be the best adapted model. We could explore further classification models like Naive Bayes, Decision Trees, Support Vector Machines, Neural Networks. - the absence of cross-validation: we have trained the model on one training set. We could further explore this area.


Final Plots and Summary

Plot One

Description One

This plot shows the distribution of volatile acidity for each quality rating. We can visualize with the position of the boxplots that the volatile acidity tends to be lower for higher quality rated wines. Only wines rated 7 and 8 have similar distributions for volatile acidity. This makes the volatile acidity the best candidate for a primary predictor of red wine quality.

Plot Two

Description Two

This plot shows by quality rating, how the wines are distributed in terms of volatile acidity and citric acid concentration.

We can observe that the green points (high quality wines with a rating of 7 or 8) tend to be clustered around high citric acid concentration (0.25 to 0.75 g/L) and low volatile acidity (0.2 to 0.6 g/L). Low quality wines tend to be clustered around low citric acid concentration (below 0.25 g/L) and high volatile acidity (above 0.6).

This clustering indicate that the combination of volatile acidity and citric acid concentration would make good predictors for wine quality.

Plot Three

Description Three

This plot shows by quality rating, how the wines are distributed in terms of volatile acidity and alcohol content.

We can observe that the green points (high quality wines with a rating of 7 or 8) tend to be clustered around high alcohol content (10 % and above) and low volatile acidity (0.2 to 0.6 g/L). Low quality wines tend to be clustered around low alcohol content (below 10 %) and high volatile acidity (above 0.6).

This clustering indicate that the combination of volatile acidity and alcohol content would make good predictors for wine quality. ——

Reflection

In this analysis, there were around 1,600 observations with 11 variables to consider. We have started by plotting univariate plots to understand the data structure. Then we looked at relationships between the label of of study (quality) against other features. We isolated the pairs that displayed the strongest relationship on a pair plot. We explored further by plotting bi-variate plots and ultimately multi-variate plots. We eventually started a quick prediction model based on an ordinal logistics regression.

The analysis revealed that volatile acidity and alcohol contents were good predictors of the wine quality. The volatile acidity tends to be lower for higher quality rated wines (below 0.6 g/L) . This is to be expected as high volatile acidity is largely comprised of acetic acid (vinegar), which is associated with unpleasant taste.

The analysis confirms the influence of alcohol content in the wine quality. This is to be expected as tasters would expect from a red wine a stronger body.

The influence of fixed acidity was not clearly demonstrated by the model (m4). This is not surprising, as citric acid - a desired acid for its softer acid taste and pleasant aromatic properties - is only one of its many constituents.

It was surprising to see that chlorides and sulfur dioxides did not appear to be good predictors. Naturally, one would think that those chemical properties, associated with negative perception (taste or health), would be good indicators of lower wine quality.

Overall, acceptable prediction of the wine quality can already be achieved with only 2 variables of the model m2 with volatile acidity and alcohol.

There are several axes of further exploration and improvement in the prediction by deep-diving: - the influence of other chemical properties: glycerol (a contributor for the mouth-feel and texture), amino acids, tannin, minerals etc. - the choice of the machine learning model to apply : choosing multinomial logistics regression may not be the best adapted model. We could explore further classification models like Naive Bayes, Decision Trees, Support Vector Machines, Neural Networks etc. - the fine-tuning of the model hyperparameters: no adjustment at all was done in this exercise. - the machine learning model cross-validation: we have trained the model on one training set. We could further explore this area.

**** Reference - http://waterhouse.ucdavis.edu/whats-in-wine/ - https://www.r-bloggers.com/how-to-perform-a-logistic-regression-in-r/ - https://stat.ethz.ch/pipermail/r-help/2003-May/033335.html - https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/ - ggplot, ggally and R documentation